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Abstract 

We give a practical random mapping that takes any set of documents represented as vectors in 
Euclidean space and then maps them to a sparse subset of the Hamming cube while retaining ordering of 
inter-vector inner products. Once represented in the sparse space, it is natural to index documents using 
commercial text-based search engines which are specialized to take advantage of this sparse and discrete 
structure for large-scale document retrieval. We give a theoretical analysis of the mapping scheme, 
characterizing exact asymptotic behavior and also giving non-asymptotic bounds which we verify through 
numerical simulations. We balance the theoretical treatment with several practical considerations; these 
allow substantial speed up of the method. We further illustrate the use of this method on search over 
two real data sets: a corpus of images represented by their color histograms, and a corpus of daily stock 
market index values. 


1 Introduction 

Rapid document retrieval is a basic problem in modern large-scale computation. One is given a corpus of 
documents and a query, and the goal is to find documents in the corpus that nearly match the query. We 
consider the case where the query is itself a document and thus our problem is closely related to approximate 
nearest neighbor seareh. While this problem becomes challenging in the high-dimensional setting, in the 
special case of text documents it has been addressed, in practice, with success. Text documents may be 
represented as “bags of words” in which each document is represented as a vector, whose length is the size 
M of the dictionary, and whose i-th entry contains the number of times word i appears Q Commercial search 
engines have been built to take advantage of this sparse and discrete structure for large scale search. 

However, many documents of interest are best represented as vectors in [R^, which are not amenable to 
large scale search in their original form. A small set of examples includes images, video, and audio; biometric 
data such as fingerprints and iris scans; and bioinformatic data such as gene expressions, enzyme activities, 
and protein libraries. Given the high degree to which text-based search has been implemented, including the 
consequential familiarity of software engineers for text-based search infrastructure, we give a way to adapt 
this familiar infrastructure to search in 

Naive search for objects in R^ for d > 50 is prohibitively slow when the corpus is large. Indeed, as stated 
in [1], “Despite decades of intensive effort, the current solutions suffer from either space, or query time, that 
is exponential in dimension d”. See [21] for an overview. A typical solution is to apply a dimension-reduction 
technique, such as that espoused by Johnson Lindenstrauss Lemma m, which maps the data into a lower 
dimensional Euclidean space. While this has a beautiful theoretical backing, in practice Euclidean space is 
not ideally suited for search. Many other mapping strategies have been invented, in particular the loeality- 
sensitive hashing used in approximate nearest neighbour search [311]. These give theoretically appealing 
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^For example, the first sentence of the introduction would be represented as a vector with 12 entries equal to one and all 
other entries equal to zero. 
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low-complexity and low-storage guarantees, but are typically dissimilar in operation to text-based search, 
so cannot leverage existing search infrastructure, nor be naturally combined with text-based search. In this 
paper, we give an easy to use practical mapping which puts the data precisely in the form used by commercial 
text search engines, namely in high-dimensional but sparse subsets of or even of {0,1}"^. Mapping real 
valued vectors to discrete measurements is also the approach of locality sensitive hashing. However, in this 
paper we emphasize mapping to a sparse set to put the data into the same representation as a text document. 
We believe this falls outside of the perspective considered in prior literature. 


1.1 Random mapping scheme 

Although proving that our random mapping scheme works is involved, the scheme is remarkably simple. 
Our corpus A is a finite collection of vectors in [R^, normalized to have unit £2 norm. To transform each 
vector in A, multiply each vector by a random matrix, then threshold each element. We now formalize this 
procedure. 

Introducing notation we will use throughout, let ai,a 2 ,... ,a^ be standard normal random vectors of 
length d. Fix h > 0. Map x G A C to the Hamming cube {0,1}"^ as follows 


^ ^ iM{ai^x)>h])iLl‘ ( 1 - 1 ) 

Above, l[{ai,x)>h] is equal to 1 if {ai^x) > h and 0 otherwise. Note that x is mapped to a sparse vector 
provided h is large enough. 

After indexing each document in this manner, we search by performing the same transformation to query 
vector y G We then take inner products in the Hamming cube to determine the best match. Thus, we 

return x G A to the user in decreasing order of score. 


S {x,y) = 


1 

m 


m 

'^[{ai,x)>h]'^[{ai,y)>h]- 


( 1 . 2 ) 


The bulk of this paper is devoted to proving and demonstrating that with appropriate choices for parameters 
h and m, this procedure returns x G A closest, or nearly closest, to query y with high probability. 

The astute reader will already observe that the transformation itself is 0{md): Although the theory 
is straightforward to introduce with Gaussian, we elaborate in the section on practical considerations. 
Section]^ on an alternate choice for the random transformation that is O(mlogm). We show through simu¬ 
lations that this alternate strategy gives identical performance to that observed with a Gaussian transform. 

The particular case where h = 0, correponding with measurements of the form sign((ai, x)), was in¬ 
troduced as a method for locality-sensitive hashing in [5], and is well studied in general. Similarly, other 
authors [iniiiHj threshold to h = 0, but replace (a^, x) with an inner product with respect to a kernel tailored 
to the corpus A in order to gain greater fidelity in the hashed (transformed) space. Assuming A C 
the vanilla sign mapping has been shown to be a near isometry from the sphere with geodesic distance to 
the Hamming cube, even in the case when X has an infinite collection of elements [20]. Thus, it is a very 
effective locality-sensitive hash. Further, this near-isometric property is pivotal in 1-hit compressed sensing 
[3119]. In contrast, we focus on the case when h is much greater than 0. We show that, while the mapping 
is not a near isometry, it still does preserve ordering of inner products, i.e., if {x^y) > (z^y) then, with high 
probability, S (x^y) > S (z^y). This relationship is sufficient for effective document search and by relaxing 
the need for near-isometry we gain sparsity and thus improve search speed. As it happens, our choice of 
h > 0 also provides increased sensitivity to distinguishing the nearly-similar items we in practice expect to 
find among the top search results. 


1.2 Notation 

gd-i 5 ; 0 f 0 j.g ^p 0 Euclidean sphere in d dimensions; the corpus X C is a set of documents to be indexed 

(we assume each document has been normalized); x e X, y e always refer to a corpus document, and 
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a user query, respectively; /i(A) = ES{x^y) is the expected score of query y against document x given their 
true similarity as their inner product A = (x, y)] the vocabulary size m is a large parameter that we control. 

= P(A^(0,1) < t) is the cumulative distribution function of a standard normal random variable; 
ai,a 2 ,..., are standard normal vectors, each of length d; the threshold value h is parameterized by a 
scalar r so that h = h{m^ r) := \/2r logm. 

For two sequences of numbers, 61 , 62 ,... and / 3 i,;d 2 ,..., we say that bm is asymptotically equivalent to 
denoted bm ^ Pmi if and only if 

6m = /^m(l + ^(1)) 

where o(l) ^ 0 as m ^ oc. If Pm 7 ^ 0 foi* each m, then bm ^ Pm if and only if 

r V 1 

hm —-1. 

m^oo P^ 

Note that this (standard) equivalence relationship is preserved under addition, multiplication, and division. 

2 Problem definition 

As previously noted, there exists a great body of literature addressing nearest neighbor and approximate 
nearest neighbor search. We broaden this perspective slightly to address the more practical problem of 
producing a list of results, ordered by decreasing relevance. We therefore introduce notions of relevance, 
retrieval, and errors in order to use standard information retrieval performance metrics (see [15] for an 
overview). 

Recall that /i(A) := ES{x, y) provided (x, y) = A. Thus, with a goal of retrieving documents whose inner 
product with y is at least A, we will return any document whose score exceeds /i(A). 

Definition 2.1 (Document sets). Fix A G [—1,1], ^ G 8 ^“^, and x G A. We call the document x retrieved 
if and only if S (x^y) > /i(A). We call the document x relevant if and only if {x^y) > A; a document that is 
not relevant is irrelevant. 

While purely relevant documents are appealing, in practice the mapping will cause some (small) error. 
Thus, we replace the notion of relevant with e-relevant. 

Definition 2.2 (Document sets with small error). Fix A G [—l,l],e > 0, ^ G 8 ^“^, and x E X. We call 
the document x e-relevant if and only if (x, ^) > A + e; we call the document x e-irrelevant if and only if 
{x,y) < X-e. 

Borrowing notions from hypothesis testing, we identify two important events: 

Definition 2.3 (Error events). We define two types of events: If an e-irrelevant document is retrieved this 
is called a type I error. If an e-relevant document is not retrieved this is called a type II error. 

See Figure for an illustration. 

The practitioner will have two main goals: 

1. Minimize the complexity of retrieval in space (precomputation, also called indexing) and time (search) 
by making the mapping as sparse as possible; and 

2. Minimize e while still controlling the probability of type I errors and type II errors. 

Not surprisingly, there is a tradeoff between those two goals. Our main theoretical contribution, given in the 
next section, is to precisely characterize the sparsity and the size of e as a function of adjustable parameters 
and given the desired bound on type 1 errors and type II errors. 
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Type I Error occurs if irrelevant 
document is retrieved 


Type II Error occurs if relevant 
document is not retrieved 


A - c \ A+ E 


Figure 1: Diagram of kinds of errors. In the area between A — e and A + e, we do not expect documents to 
be returned reliably. We give a precise characterization of the size of e required to keep the probability of 


type I errors and type II errors small in Section 3.3 


3 Theoretical results 

This section states results concerning sparsity of our mapping and asymptotic and non-asymptotic statements 
of its accuracy. These results anticipate the roles of parameters h and m in the tradeoff between accuracy 
and sparsity (hence, complexity) elucidated in Section]^ 


3.1 Sparsity of the map 

We begin by controlling the expected sparsity of mapped vectors. Recall that each vector x G T is mapped 
to the sequence '^[{ai,x)>h]i'i^ = 1,2 ,... ,m. Observe that (ai^x) is standard normal, and thus the expected 
number of non-zero components of a mapped vector is mP(A^(0,1) > h) = m{l — Recall that 

h = y/2r \ogm. For large m, which implies large h, one may use the classic [7] approximation 

= (3.1) 

Note that the right-hand side is not only asymptotically accurate, but is also a non-asymptotic upper bound 
on P(A^(0,l)>h) [7]. 

To conclude, let k be the number of non-zero entries in a mapped vector. Then k satisfies 


Ek = m{l — ^(A/2rlogm)), 


Ek 


m" 


\/4Prlog7n ’ 


and 


Ek< 


m" 


\/47rr logm 


(3.2) 


Thus, one sees that after ignoring logarithmic factors there are roughly non-zero entries. In fact, it is 

for this simple expression, and similar expressions below, that we parameterize h as h = ^/WTogrn. Indeed, 
in Lemma [3T| below, we see that r naturally characterizes a phase transition. 

We now proceed to the more significant part of the theory: characterizing the size of the error. This will 
come from controlling the concentration of the score around its mean. We begin in the asymptotic regime 
to give an exact result and a useful intuition. 


3.2 Asymptotic characterization of the score 

We begin with the following lemma, concerning the asymptotic normality of S{x^y). 

Lemma 3.1. Fix x^y e with {x^y) = A < 1. Let y = y{X) and = cr^(A) be, respeetively, the 

expeetation and varianee of l^(^ai,x)]'^[{ai,y)]' Depending on there are two eases to eonsider. 

Case 1: A < 2r — 1. 

Then 

P{S{x,y) 7^ 0) < y{X)m ^0 as m ^ oc. 

Case 2: A > 2r — 1. 
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The normalized seore 


(3^3) 


(J 

eonvergenees to a standard normal random variable in distribution as m ^ oo. 

The phase transition at A = 2r — 1 can be intuited by the following observations. When A < 2r — 1, the 
expected number of non-zero summands in S{x,y) converges to 0. When A > 2 r — 1 , the expected number 
of non-zero summands converges to infinity, and the score exhibits Gaussian behaviour. Let us derive this 
precisely. To ease notation, here and below let (a^, x) = Wi and (a^, y) = Vi. Let w^v he standard normal with 
covariance A so that Wi^Vi are independent copies of w,v. Consider EmS{x,y) = m/i, that is, the expected 
number of non-zero summands in the score. To control this quantity, we need the following bivariate normal 
tail approximation, which can be derived from [ 22 ] . 


M = = 1) ~ (-1^) 

Thus, the expected number of non-zero summands is 


C(A) 


_ 2r 

m 1 +^ 
2 r log m ’ 


C(A):= 


ji+yy 

27 r\/l - A 2 ' 


(3.4) 


A-(2r-l) 

EmSix.y) = my ^ -. (3.5) 

2 r log m 

It is now clear that this quantity converges to infinity for A > 2r — 1 and converges to 0 for A < 2r — 1. This 
latter observation, combined with Markov’s inequality, already completes the proof of the lemma in Case 1 . 
Indeed, Markov’s inequality shows that 


P{S{x,y) 7 ^ 0) = P{mS{x,y) > 1) < EmS{x,y) = my{X) 0. (3.6) 


The Gaussian behaviour of Case 2 does not follow from the vanilla central limit theorem since the summands 
depend on m (through h), but nevertheless is proven as an application of a Berry-Esseen approximation. 
We state this approximation, and complete the proof, in Appendix |A. 1 | below. 


3.3 Main theoretical results 

We now leverage Lemma [3.1| to determine the efficacy of search via the sparsifying transformation of this 
paper. In the following theorem we give a precise asymptotic characterization of e, which vanishes as m 
increases. Below, 77 is a parameter which controls the expected number of errors. 

Theorem 3.2 (Main theorem, asymptotic version). 

77 > 0. Let 

A-(2r-l) 

e = e(A, 772 , 77 , r) = C(A, r, 772 , 77 ) 771 2 (i+a) ^ 

Then, 


Fix 7 / G ^5 A G (2r — 1,1) also satisfying A > 0 and 


rif\ \ (1 + A)(l — A^)^/^ 

C(A, r, 772 , 77 ) := - - 77 . (3.7) 

\/2r 102:772 


EiNumber of type I errors) + EiNumber of type II errors) ^ / ^ x 

hm sup —^^^^ = P(A/(0,1) > 77 ). 


Thus, by taking 77 = 2v^log72, we have 


hm sup 


E{Number of type I errors) + E{Number of type II errors) 

n 


< 


1 

log 72 
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The last inequality follows from the Gaussian tail bound (|3.1|). Note that the last inequality combined 

1 


n\/87T log n 


o(l), there are no type I 


with Markov’s inequality implies that, with probability at least 1 
errors or type II errors. 

It may be helpful to write this in a different way. Define the (good) event G := {For every x ^ X 


if 


{x^y) > A + e, then S{x^y) > /i(A); if {x^y) < A —e, then S{x^y) < /i(A).} Then the theorem, combined with 
Markov’s inequality, implies that 


lim sup P(G^) < nP{N{0, 1 ) > 77 ). 

^^^|A'|=n 

We pause to remark on how this result falls into the framework of approximate nearest neighbour search. 
Note that vanishes as m ^ oc and thus, asymptotically, the search returns precisely the documents 
with the desired level of inner product with y. However, for large, but finite m, exact recovery is not 
expected. Instead, there is a small interval around A, and if a document falls into this interval, we cannot 
predict whether it will be correctly returned (or not returned). Outside the interval, documents are returned 
precisely as desired with high probability. Our main result above characterizes the size of this interval. This 
is a version of an approximate nearest neighbours solution [2] . 

In order to give a more traditional treatment of the nearest neighbor problem, one may be interested 
in just keeping the document with highest score, rather than keeping all documents whose scores exceed 
a certain threshold. The theorem may be leveraged to describe the accuracy of this method. Let xq be 
the vector in X which is closest to y, and let Aq = {xo^y). While the document with highest score is 
not guaranteed to be xq, it is guaranteed to have inner product with y nearly as high as Aq. The easiest 
way to see this is to make a slight change in the definition of G by switching A with A — e. Thus, let 
e = e(Ao, m, r, rf) be defined as in Equation (3.7) and define the event G' := {For every x G T , if (x, > Aq, 


then S{x,y) > //(Aq — e); if {x^y) < Aq — 2e, then S{x^y) < //(Aq — e).} One can see, by tweaking the proof, 
that the theory still holds with G replaced by G'. If the event G' holds, then S{xQ,y) > /i(Ao — e), and, for 
any x G X with {y, x) < Xq — 2e, one has 5'(x, y) < /i(Ao — e). Thus, under this event, which holds with high 
probability, the document with highest score must have inner product with y at least Aq — 2e, i.e., it is an 
approximate nearest neighbour. 

In the above theorem, we focus on asymptotics in order to give a rule of thumb; in particular, note that 

A-(2r-l) 

the main term quantifying the rate at which of e decreases is m 2 (i+a) ^ However, we find in numerical 
simulations that m must be quite large to realize the expected asymptotic behaviour. Thus, we present the 
following non-asymptotic version of this theorem, with a more complex expression to quantify the interval, 
but one that may be verified in numerical simulations even for modestly large m. 

For use in this theorem, we make a slight change in the definition of errors, allowing a different value of 
e for type I errors versus type H errors. 

Definition 3.3 (Error events). We define two types of events: If an e~-irrelevant doeument is retrieved this 
is ealled a type I error. If an -relevant doeument is not retrieved this is ealled a type H error. 

Theorem 3.4 (Main theorem, non-asymptotic version). Fix y G A G (2r — 1,1) and 77 > 0. Define e~ 
and to satisfy 


cr(A — e~) 




IJ.{X) - /x(A + e+) 
cr(A + e+) 


m = —T] 


(3.8) 


provided there are solutions. 

If there are solutions, then 


E(Number of type I errors) + E(Number of type II errors) \ \ 

sup- P{N{0, 1) > 77 ) 

|A’|=n ^ 


< 


\/l^{X-e ) 


m 


For sufficiently large tu. Equation (3.8) has solutions, and furthermore, e , e+ 0 (see Remark |A.6| 
below). 
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Remark 3.5 (Understanding the accuracy of the Gaussian approximation). The bound on the aeeumey 
of the approximation 1 //i(A — e~)m has an intuitive meaning, whieh is further verified in simulations. 
Indeed, note that fi{X — e~)m is roughly the expeeted number of non-zero summands in the seore 

when {x,y) ~ A (see Equation ( |3.5[ ) ). If this value is very small, one expeets the seore to be approximated 
by a diserete Poisson distribution, rather than the eontinuous Gaussian distribution given in the theorem. 
In this sparser regime, we find that the aeeuraey oseillates above and below what is expeeted by the Gaussian 
approximation as m is inereased. This effeet is well studied in m- 


4 Practical considerations 

Practitioners implementing our sparse mapping scheme need to consider tradeoffs between complexity and 
accuracy. In this section, we not only illustrate how the results of the previous section inform practical 
design decisions that must be made when incorporating sparse mapping into existing infrastructure, but we 
also address the cost of performing the mapping itself, proposing the use of a structured random matrix 
over the Gaussian random matrix used in our analysis thus far. Although the principal advantage of our 
approach is that it can be implemented on standard search engine infrastructure, we also illustrate here the 
marked improvement in complexity over an exhaustive search. Nevertheless, we emphasize that optimizing 
the precise complexity-accuracy tradeoff is not our main goal in this paper; instead the goal is to map to a 
space utilizable by commercial search engines. 


4.1 Complexity-accuracy tradeoffs 

Previous sections point to the tradeoff between search complexity and accuracy. Search complexity is de¬ 
termined by the amount of data provided to the two processes - indexing and searching - undertaken by 
the search engine. Accuracy is determined by the loss of fidelity in our random transform of documents and 
queries. Here we make explicit the relationship between parameters m and r and their combined effect on 
complexity and accuracy. 


Indexing cost In the indexing step, the search engine pre-processes the documents into a data structure 
suitable for efficient searching. The indexer creates a posting list for each term it encounters, and appends 
to each posting list the pointers to documents containing that term. In our case, each term is an element of 
the m-dimensional output of the random projection; documents exceeding our threshold h in a dimension 
are added to its respective posting lisfj^ 

The cost of storing the search index is dominated by the number elements appearing in each of the term 
posting lists. Equivalently, this is the number of unique terms in each document multiplied by the number 
of documents. After our transformation, sparsity estimates (3.1) indicate that indexing the entire corpus, 
size |A| = n, has an expected storage cost of 


0{nk)=o(^^lL). (4.1) 

V ^/r log m J 

That is, r determines the rate at which indexing complexity increases with m. Figure is a plot of the 
relationship between m and Ek for various r. 


Search cost Sparsity in the transformed space also determines search time complexity. We first bound 
worst-case complexity. The search engine must expect to examine the k posting lists indicated by the query 

^Readers familiar with search technology will note that a posting list can furthermore store the number of times the term 
appears in each document. For simplicity, we do not exploit this capacity, instead preferring to increase m, hence the number 
of terms we index. 
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Sparsity js/Xog m 




10’^ 

10" 10^ lO'^ 10^ 10® 

vocabulary size: m 

(b) Structured mapping (simulated) 


Figure 2: Expected posting list length, or sparsity, showing the effect of r on the rate of sparsity increase 
with m. Note that since r > 0, the expected number of posting lists in which each document appears is in 
sublinear in m. 


vector (in transformed space). Each of these lists has length bounded by n, and hence each search must 
examine 

EO{nk) = O 

\ V 2r log m J 

documents. However, in practice one would not expect each mapped vector in the corpus to have precisely 
the same support as the mapped query vector, and thus each list size would be much smaller. 

We now give a rough average-case complexity. In practice, we find most data sets are clustered, i.e., there 
are a cluster of corpus elements close to the query, and the rest are nearly orthogonal to the query. Indeed, 
in image search, most images in a data base have nothing to do with the query, and unrelated (random), 
high-dimensional, vectors tend to be nearly orthogona]^ Thus, as an approximation, consider the case when 
ni of the corpus elements are near to the query, and n — rii corpus elements are orthogonal to the query. 
Eurther, note that when the query vector and the corpus vector are orthogonal, the random mappings are 
independent. Thus, for each of the n — ni corpus elements orthogonal to the query, the probability that 
this element contributes to a posting list is bounded hy EO{k/m). It follows that each posting list has an 
expected length bounded by EO{{n — ni)k/m) + EO{ni). If n ^ nim/k^ the first term dominates. Then, 
since there are k posting lists, each search must examine an expected 

EO{nk^lm) = o( (4.2) 

\ r log m J 


documents. Here is where we see an advantage over the exhaustive comparison of the query against each of 
n documents, an 0{nd) procedure. 

Einally, to speed up search, we suggest using a larger value of h to generate queries than that used 
to index the corpus. This makes query vectors sparser than the Ek calculated above. This modification 
leverages engineering design choices consequential to an asymmetry in text-based search, where queries tend 
to have many fewer terms than the text documents they retrieve. Eor simplicity, we do not address this 
scenario in our section on theory (although we believe it would be straightforward to adjust our theory to 
this setting). However, our search experiments of Section]^ use larger h for search query transformation than 
for indexing with no loss of fidelity, with significant improvements in search-time performance. 

Of course, the costs of indexing and search are determined not only by transformed document sparsity, 
but also by the cost of doing the transformation. In the case of our transformation by a dense Gaussian 
matrix, this is the cost of a matrix multiplication, 0{md). We can make this cost sublinear in m if we use 


a structured random matrix to transform our corpus vectors: see Section 4.2, below. 


^The inner product between random high dimensional vectors concentrates very close to zero M- 
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0.2 


Error: ?j = 1.645, A =0.9 



vocabulary size: m 


Figure 3: Relationship between accuracy and m for various parameters r. Two curves for each value of r are 
shown, the upper and lower curves representing and e“, respectively. The choice r] = 1.645 corresponds 
to a 5% probability that we encounter a type I or type II error. 


Accuracy We focus our discussion of accuracy on the rate at which we erroneously return documents that 
do not match the query. The essential result is that the interval [A — e“,A + e+] shrinks as m increases 
at a rate that depends on r. Although given implicity by ( |3.8[ ), the relationship between this confidence 
interval and m is approximately ^ 0{vm7^). (Up to logarithmic factors, this heuristic matches 

the asymptotic theoretical rate of decay of e given in Theorem |3.2| for A approaching 1.) Figure gives an 
example relationship between and m for A = 0.9. Naturally, accuracy in this sense increases with m, 
although contrary to cost estimates, does so more rapidly with decreasing r. 


4.2 Fast mappings using structured matrices 


Because a Gaussian random matrix A G is dense and unstructured, transforming each corpus document 

and each query is 0{md). To speed this up, we take advantage of the growing literature on structured 
random mappings, which suggests that structured random matrices, which allow fast transforms, behave 
similarly to Gaussian matrices. In particular, m shows that the metric-preserving property of the Johnson- 
Lindenstrauss Lemma can be achieved via multiplication by a random diagonal matrix followed by a (fast) 
discrete Fourier transform. Thus, we try a similar fast, structured mapping, and show through numerical 
simulations that it behaves much the same as a Gaussian matrix, provided that m is large enough, and the 
documents are not overly sparse. 

The structured mapping we propose is in two steps. First, we apply a random but fast linear mapping 
of our vector x G to an intermediary u G Then, we perform a discrete cosine transform on 

thresholding each element of the output by h > 0, just as we did for the Gaussian transform. 

In our experiments, the specific transform D we choose to expand u = Dx creates m/d copies of x, then 
randomly changes the sign of each element. That is, we may represent this 0{m) operation as multiplication 
by matrix 


Di 

D2 


(4.3) 
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where each block Dr G is a diagonal matrix of random ±1. We have restricted m^d such that d is a 

divisor of m. 

To intermediary we apply a normalized type 2 discrete cosine transform (DCT-II), which we can 
represent as v = Fu, or specifically, 


i=i 


7r(z-l)(2j-l) 

2m 


i = 1.. .m 


Ci = 


1, i = 1 
\/2, otherwise 


(4.4) 


This is an O(mlogm) transform, just as is the complex discrete Fourier transform. The normalization factor 
we choose for F in ( |4.4[ ) ensures that E = m, just as E = m where A e is the Gaussian 

matrix we chose previously. With this choice, we can use the same threshold h that we used in the Gaussian 
analysis. 

Gomputer experiments suggest that our proposed structured mapping approximates the behaviour of the 
Gaussian mapping that we prove in our main theorems. Figure ^b\ illustrates not only that sparsity grows 
at a rate comparable to the /^/[ogm) rate realized by the Gaussian mapping, but bears the same 

absolute values. 

Figures demonstrate similar behaviour of our Gaussian and structured mappings in the model case 
where we have a corpus of a single document. We query X = {x} using vector setting (x, y) = A — e“ 
to examine type I error and (x, y) = A + e+ when examining type II error. Queries are iterated over many 
random transforms (as opposed to many random x^y) to measure error rates. 

Note that choosing d = 2 in simulating the Gaussian mapping gives the same result as larger d: all 
that we require for this mapping is that each mapped element {oi^x) ^ A^(0,1). For testing the structured 
mapping, we select dense x^y of modest dimension d = 100. 

We do not believe that the particular F, D we choose to construct our structured random mapping are 
the only possibilities. In particular, we imagine that the speed of at least the Fourier transform step can 
be increased by leveraging recent work in randomized Fourier transforms that realize logm log(m/d)) 
complexity of determining the k larges entries of the Fourier transform [8]. 

However, we caution the reader that even if normalized correctly, the expansion u = Dx cannot be 
arbitrary: In one of our early attempts, we chose D G a matrix with exactly one random ±1 in each 

column and at most one non-zero in each row. Although the sparsity of the resulting map (not illustrated) 
is comparable to the sparsity of the Gaussian map, the asymmetry between simulated type I and type II 
errors shown in Figures and indicate output biased towards moving relevant documents apart from 
each other. We conjecture that for an unbiased transform, the vector we input to the Fourier transform must 
be dense with elements having mean zero. However, a study of the class of structured random mappings 
which approximate the behaviour proven in our main theorems remains, for the time being, future work. 


5 Experiments 

We now evaluate the performance of the random mapping approach using two different datasets: search 
based on the color of Wikipedia images, and Dow Industrial market data based on closing value percentage 
differences going back to the inception of this index. Searches take a seed item as a query, and attempt to 
find other items having similar features. We note that our two examples are from remarkably different areas 
of study: all that is required is that corpus documents be represented by vectors in R^. In both cases, we 
evaluate recall and precision as well as ranking of top search results. Below, relevant is the set of all relevant 
documents in the corpus and retrieved is the set of all retrieved documents in the corpus. Precision and 
recall scores are calculated as: 

\relevant r\retreived\ „ \relevant r\retrieved\ , . 

precision = -,---, recall = -^^^- (5.1) 

\retrieved\ \relevant\ 
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Type I error Type I error Type I error 


0.6 


Gaussian: 771 = 100000, d=2, A =0.9 


0.6 


Gaussian: m = 100000, d=2, A =0.9 


- r=0.2 



Structured: m =100000, d=100, A =0.9 


- r=0.2 



Biased Structured: m = 100000, d = 100, A =0.9 

- r=0.2 

- r=0.4 



- r=0.2 



Structured: m =100000, d = 100, A=0.9 

- r=0.2 



(d) Structured type II error 


Biased Structured: m = 100000, d = 100, A =0.9 



(f) Biased structured type II error 


Figure 4: Comparison of Gaussian transformation supported by theory and simulated results of the structured 
mapping for a random 1-element corpus. 
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5.1 ImageCLEF Wikipedia image corpus 

For its intuitive evaluation and ubiquity, our first test is a search of images of common color in the ImageCLEF 
2010 Wikipedia Collection [16]. We process each image by transforming images to HSV colorspace, and 
binning the pixels of each image into histograms. Two example searches are shown in Figure For these 
searches, we use other images as the queries, and so the first step in each search is to extract the color 
histogram from the query image before transforming it. 

Results for histograms of 128 color bins are shown. We achieve good results with corpora represented by 
32 and 64 bins. However, as shown by the search-time comparison in Figureour method does not penalize 
large document vectors, and we choose the higher-resolution histograms. We also choose our structured 
random embedding over the Gaussian transformation at no loss in functionality. Note that the histograms 
accompany each image in Figurej^so that errors due to our transform may be distinguished from quantization 
error due to color histogram binning. 

Figurej^gives a more quantitative representation of performance as the precision-recall curve. To generate 
each point on these curves, we fix {x^y) = A our threshold for relevant documents, and S{x,y) = /i(A) our 
threshold for returned documents. Error bars representing the standard error of precision and recall are 
generated for each point by sampling over may query images. Naturally, precision increases with A, but most 
importantly, the area under the mean precision-recall curve is nearly 1, showing that our method tends to 
preserve the ordering of ranked results. 

5.2 Dow Industrial corpus 

Inspired by the finance literature employing nearest neighbour estimates for forecasting markets, the second 
test of our method draws on the Dow Jones Industrial Average m- Each vector of this corpus is a 10- 
element vector of relative differences between closing index values for each of preceeding 5 and succeeding 5 
trading days. “Bullish” elements are characterized by large positive differences, while “bearish” differences 
are characterized by large negative differences. A search, thus, takes a single day as query and finds days 
with similar time-local trading patterns. Example results of two searches, queried by each of a bearish and 
bullish day, are shown in Table and Table 

This second example has two appealing features: Eirst, it is not an image corpus, illustrating that our 
method is not an image search method (though it may find application there), but is a search for any data 
well-represented by vectors in R^. Second, as shown in Eigure the statistics of these Dow data differ 
significantly from those of the image data. Despite these different statistics, the precision-recall curve for 
the Dow Industrial data, Eigure shows the ranked search performance to be comparable to that for the 
image data. 

5.3 Online resources 

See https://gitlab.com/dgpr-sparse-search/code for code for the simulations in Section and the 
search demonstrations. The code for this project is written in Python, with simulations appearing as IPython 
notebooks. Whereever possible, custom code is avoided in favour of off-the-shelf open-source projects. Search 
examples use the Django web framework, and are powered by the Whoosh search package. As evidenced 
by the long search times (median ^ 500 ms for our set of 270K images). Whoosh is not the fastest off-the- 
shelf search engine. Rather, we select it for its ease of configuration and structural similarity to compiled 
off-the-shelf search engines such as Apache Lucene and ElasticSearch. 
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Results 


Search Time: 0.57 seconds 



(a) Mediocre result 

Results 


Search Time: 0.60 seconds 



(b) Good result 


Figure 5: Example searches for images of similar color over the Wikipedia ImageCLEF corpus. The top left 
is the query. Note that even in the case of the visually mediocre result, the match of the histograms is good. 
The medicore result is mediocre because of the lack of good color resolution in the HSV bins for this query. 
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Search Time (s) 


Figure 6: Empirical cumulative distributions of search times for each of the three color histogram binning 
schemes, namely, d = 32,64,128. Search times have similar statistics, independent of d. 


Precision vs Recall: m =30000, r=0,2, A =0.8 



Recall 


Figure 7: Precision-Recall curve for 128-bin histograms showing the performance of the sparse transform 
parameterized by T, the threshold for relevant results. Error bars represent a range of performance over 
randomly sampled query images. 


14 

























Date 

% Change 

True score 

Vector 
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1.0 

-— 

1957-05-28 
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- 
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-— 
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1971-11-02 
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- 
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0.6 

0.843 

- 

1985-08-19 

-0.017 

0.868 

-- 

1929-10-29 

-11.729 

0.865 

-- 

1896-07-21 

0.59 
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Vector 

2008-10-28 

10.878 

1.0 


1980-12-12 

0.958 
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Table 2: Query is the Dow boom of 2008-10-28 (top 
row). 


Precision VS Recall: m =30000, r=0,2, A =0.8 


i=i 

o 

o 



Figure 8: Precision-recall curve for the Dow data parameterized by T, the inner product threshold for 
relevant results. 


15 



























(a) Wikipedia ImageCLEF 128-bin color histograms, (b) Dow Industrial daily differences, 10-day window. 

Figure 9: Comparison of the statistics of the Wikipedia ImageCLEF and Dow Jones Industrial corpora. 
Each plot shows the density of the projection of the corpora onto their first two singular vectors. 
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A Proofs 


A.l Proof of Lemma 13.11 

We will need the following non-asymptotic version of the Gentral Limit Theorem [23] . 

Theorem A.l (Berry-Esseen Gentral Limit Theorem). Let zi, Z 2 ,..., Zm be independent, identically dis¬ 
tributed, mean-zero, random variables satisfying Ezf = cr^. Set 


Sm = and p = E\zi\^ 

a Cm 

Let Fm{t) be the cumulative distribution function of Sm- Then for all t and m. 


\Fm{t)-m\<co- 


(A.l) 


where Co = 0.4748. 

to characterize the rate at which S{x,y) converges to a standard normal 

random variable. 


We may leverage Theorem 


A.l 


Lemma A.2. Let everything be as in Lemma 3.1 but with no restriction on X (aside from X G [—1,1]^. 
Note that the distribution of S{x,y) depends only on m and X, and define Fm,x{t) := P{S{x,y) < t) to be 
the corresponding cumulative distribution function. Then, 


\FmAt)-^{t)\ < 


\/Kr)m 


for all t E R. 


(A.2) 
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Proof. We apply Theorem A.l to S{x,y). Thus, let Zi = —/i(A), and note that the normalized 

score satisfies 

S{x,y)= 


a{X)y^' 


It is not hard to bound p as follows 


p := E\zi\^ < p{X){l - p{X)) = a^{X). 

Thus, the right-hand side of the Berry-Esseen bound Equation P3_ is less than l/(cr(A)y^). Eurther, 
/i(A) < P{N{0, l)>h) < P{N{0, 1) > 0) = 1/2 and thus, cr(A) = V/^(A)(1 -/i(A)) > x/y\x)/2. 

implies that for all t and m, the cumulative distribution function of S{x^y) satisfies 


Then Theorem 


A.l 


iru /X M 0.4748\/2 




This result quickly translates into a proof of Lemma 

oof of Lemma \S 
need to show that 


Proof of Lemma \3.1\ Case 1 has already been proven above. Case 2 follows from from Lemma [Ai^ we only 
1 




0 for A G (2r — 1,1). This follows from Equation (3.5). 


A.2 Proof of non-asymptotic main theorem: Theorem 3.4 


We begin with a few lemmas, which determine the behaviour of the image search procedure non-asymptotically, 
and in the simple case when \X\ = 1. 

Lemma A.3. Fix x, y ^ , and let \= {x,y). FixXe[—l,l]. 

The probability that S{x^y) > /i(A) (i.e., the event that we return x) satisfies the following bound: 


P{S{x, y) > p{X)) - P ( 1 ) > ■ ^/m 


Proof. 


P{S{x,y) > p{X)) = P S{x,y) > 


cr(A') 

p{X) - p{X') 


< 


y/p(y)m 


(A.3) 


Now apply Equation (|A.2|) of Lemma |3.1| to complete the proof. 


r(A0 


We may synthesize this result to give an interval around A outside of which one would expect to return 
(and not return) precisely the desired documents. 


Lemma A.4. Fix ^ G G [—1,1] and 77 > 0. Define e and e+ as in Theorem 3.4' Consider the 

(good) event •={^/ (^, 7 /) > A + then S{x, y) > ia{X); if (x, y) < X — e~, then S{x, y) < /i(A).} Then 

\suvP{Gl)-P{N{Q,l)>y)\< 


xj /i(A — e )m 


Proof. We begin with the following observation about the behaviour of the score. Eix x and x' with A = {x^y) 
and A' = {x'^y). Suppose that A' < A. Then S{x^y) probabilistically dominates S{x',y), i.e., for any t e R 


P{S{x',y) >t)< P{S{x,y) > t). 


(A.4) 


The above is a simple consequence of the fact that mS{x',y) and mS{x^y) are both binomially distributed, 
with respective means of mii{X') and mii{X)^ and /i(A') < /i(A). 
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We will use this observation below, but first fix x with {x^y) = X — e . Then, by Lemma A.3 

1 


PiGl)>P{N{0,l)>rj) 


/i(A — e )m 
We will now show that for any other vector x' 


and P{Gl) < P(A^(0,1) > + 


1 


y/fl{X-e )l 


= . (A.5) 


\/M(A-e )i 


which will complete the proof. Thus, fix x' with {x'^y) = A'. 

There are four cases to consider: 

Case 1: A' < A — e~. Since S{x,y) probabilistically dominates S{x',y), we have 

P{Gl,) < P{Gl) < P{N{0, l)>v)+ . , 

y/i(A — e )m 

as desired. 

Case 2: A' G (A — e“, A + e+). Then, clearly. 


P(G^,) = 0. 


Case 3: A' = A + e+. If A' > 1, then clearly P(G^,) = 0 since {x^y) is always bounded by 1. Thus, suppose 
A' < 1. Then, by Lemma [A. 3 [ 


P{GI,)<P{N{0A)>V)P 


Y^/i(A + e+)m 


<P(Ar(0,l) >7^) + 


\/M(A-e )r 


since y is monotonically increasing. 

), we may reduce to the situation 

of Case 3. 


Case 4: A + e+ < A' < 1 By the probabilistic domination of Equation (A.4 


Our main non-asymptotic theorem follows directly. 

Proof of Theorem \3.4\ Note that 

(Number of type I errors) + (Number of type II errors) = E 

fcGA’ 

The proof then follows by taking the expectation and then the supremum and then dividing by n. ■ 


A.3 Proof of asymptotic main theorem: Theorem 3.2 


Once again, this will come from manipulating the result of Lemma A.3 We have the following asymptotic 
characterization of the parameters of this Lemma. 


(A.6) 


Lemma A.5. Let A G (2r — 1,1) also satisfy A > 0 and set e as in Theorem \3.^ Let 

ti{X) - ti{X - e) _ , __ ii{X) - ii{X + e) 


7 ] := 


r(A - e) 


m, 


r]~^ := 


r(A + e) 


m 


Then 

Furthermore, 


f]^ rsu f] r 

1 


^ T] . 

= ^ 0 . 
m 


19 























Remark A.6 (A note regarding parameters in Theorem 3 . 4 ). In passing, we note that the asymptotie 


equivalenee 77+ ^ 77 ^ 77 ^ eombined with the faet that e is proportional to rj, may he manipulated to show 
that e ^ r 


, where e+,e are defined in Theorem 3.4 Thus, sinee e^t), we also have e+,e 


0 . 


Proof. We write = e to emphasize dependence on m. We will show that 77 ^77. The argument that 

77 + ^ 77 is quite similar. We control the numerator of 77“ via the first order Taylor approximation in 


/i(A) - /u(A - €m) = + O(e^) sup 


tG [A—,A] 




(A. 7 ) 


Note, while ^ 0 , it is not apriori obvious that the first term dominates asymptotically since /i(t) also 
depends on m. However, this will become apparent with a bit of calculus. 

We begin with the observation that 


d , . 1 

— u(t) = - , 

dt 27r\/l — t‘^ 


exp 


-h^ 
1 +1 


which may be found in m- A bit of calculus then gives the second derivative 

h^ t 


d? 




l-t2 


d , , 


Note that this is positive for t > 0 , and thus the first derivative is increasing. Further, for m large. A —> 0 
since A > 0 by assumption and ^ 0 . We use these observations to develop the above equation into a 
bound on the supremum of the second derivative 


sup 

te[X-em,X] 


df 

dp"'*’ - 




A 


1-A2 




The first-order Taylor approximation (A. 7 ) then becomes 


/u(A) - yu(A - em) = + 0{emh‘^)emflJ-{X). 


Thus, since 0 , we have 


li{X) - ii{X - em) ~ <^rnfKX) = 


exp 


-h^ 

1 + A 


(A.8) 


Thus the numerator of 77 is asymptotically equivalent to the right-hand side of the above expression, 
multiplied by ^/rn. Let us also note that Equation ( |A.8[ ) combined with Equation ( 3 . 5 ) imply that 


/i(A - Cm) ^ /i(A). 

We now move to the denominator of rj~, that is, a{X — e^) = /i(A - 


(A. 9 ) 


^)(l — /i(A — em)- It is not hard 


to show that /i(A — e^) ^ 0 , and thus cr(A — e^) 
Thus, by Equation (A. 9 ), we have 


y/x(A - em)- 


r(A - e) ~ -\//i(A) 


(1 + A)^ 


27r/i2-\/l — A2 


exp 


/l2 

TTa 


(A.IO) 


Now we have shown that the right-hand side of Equation (A.8), multiplied by is asymptotically equiv¬ 
alent to the numerator of 77“ and the right-hand side of Equation (A.IO) is asymptotically equivalent to the 
denominator. If you divide the former by the latter, you get 77, thus showing that 77“ ^ 77 as desired. 


We complete the proof of the lemma by showing that 


this quantity is asymptotically equivalent to 


■\f fi(X—e)m 


0 . Eirst, Equation (A. 9 ) imply that 


to 0 based on Equation ( 3 . 5 ). 


\/uix)7 


= . As we have seen before the latter quantity converges 
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The above lemma implies the following result when |T| = 1. 

Lemma A.7. Fix y G G (2r — 1,1) and rj > 0. Let e satisfy Equation ( |3.7| ). Consider the (good) 

event Gx {x^ y) > A + e, then S{x, y) > /i(A); if (x, y) < X — e, then S{x, y) < /i(A).} Then 

lim sup P{G%) = P{N{0, 1 ) > 77 ). 


Proof. By following the same steps as in the proof of Lemma A.4 we have 

sup P(G^) > P{N{0, 1) > r) - , , 

ass'*-! ^^/|I{X-€)m 

sup PiGl) < P{N{0, 1) > niin(? 7 “,? 7 +)) + ) 

ajsS'i-i V - e)m 

The proof of the lemma now follows from the continuity of P(A(0,1) >t) as a function of t, combined with 
the asymptotic characterization of parameters in Lemma 1^ ■ 

We are now in position to prove our main non-asymptotic theorem. 


Proof of Theorem This is precisely the same as the proof of Theorem |3.2[ Note that 


(Number of type I errors) + (Number of type II errors) = E 

x^X 

The proof then follows by taking the expectation and then the supremum and then dividing by n. ■ 
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